Executing JavaScript Code Directly in SQL Queries Using the jseval Function Tutorial

MLDB provides a complete implementation of the SQL SELECT statement. Most of the functions you are used to using are available in your queries.

MLDB also supports additional functions that extend standard SQL in very interesting ways. One of those function is the jseval function that can be used to execute arbitrary JavaScript code inline in an SQL query.

In this tutorial, we will show some basic usage example followed by two different use-cases for the jseval function:

  • Formatting data during the import step
  • Designing custom feature generators

Setting up

Before we begin, let's start by importing the pymldb library so we can make REST API calls to MLDB. You can check out the Using pymldb Tutorial for more details.


In [1]:
from pymldb import Connection
mldb = Connection("http://localhost")

Basic usage examples

Let's start by writing a simple SQL query that will multiply an input number by 2 in JavaScript:


In [33]:
mldb.query("""
    SELECT
        jseval('
            return val * 2;
        ','val', 5) AS output
""")


Out[33]:
output
_rowName
result 10

The variable val takes the input value 5 and the code is then evaluated.

Our function can also take in multiple parameters as input, and return different output values:


In [35]:
mldb.query("""
    SELECT
        jseval('
            var output = {};
            output["mult"] = val * 2;
            output["name"] = str_val + " Hello!";
            return output;
        ','val,str_val', 5, 'Bonjour!') AS output
""")


Out[35]:
output.mult output.name
_rowName
result 10 Bonjour! Hello!

In the above example, the string val,str_val mean that the function takes 2 input variables. Those values will be 5 and the string Bonjour!. Since we return a JavaScript object, we essentially return a row where the keys are the objects' keys and the cell values are the object's values.

Now that we have the basics in place, let's continue to a real use-case below.

Formatting data during the import step

In the Loading Data From An HTTP Server Tutorial tutorial, we loaded a specific file from an archive that was located on the Stanford Network Analysis Project (SNAP) website.

The dataset contains all the circles of friends in which user no. 3980 is part of. Each row represents a circle of friends, and all the users that are part of that circle will be enumerated on the line.

Let's check's out the unformated version of the data first, by running the import.text procedure:


In [44]:
dataUrl = "http://snap.stanford.edu/data/facebook.tar.gz"

mldb.put("/v1/procedures/import_data", {
    "type": "import.text",
    "params": {
        "dataFileUrl": "archive+" + dataUrl + "#facebook/3980.circles",
        "delimiter": " ", 
        "quoteChar": "",
        "outputDataset": "import_URL2",
        "runOnCreation": True
    }
})

mldb.query("SELECT * NAMED rowName() FROM import_URL2 LIMIT 10")


Out[44]:
"circle0 3989 4009"
_rowName
2 circle1\t4010\t4037
3 circle2\t4013
4 circle3\t4024\t3987\t4015
5 circle4\t4006
6 circle5\t4035
7 circle6\t3999\t4028\t4005\t3996\t4031\t4018\t3...
8 circle7\t3984
9 circle8\t3988\t4030\t4026\t4021
10 circle9\t3983\t3992\t4033\t4017\t4000\t3986
11 circle10\t3990\t4007\t4016\t4025

We see that each line contains the circle number followed by user ids. This type of data is an ideal candidate for MLDB, since we can store it as bags of words, or rather, bags of friends. A dataset of type sparse.mutable can store sparse representations like this one very efficiently.

Normally, we could use the tokenize function to deal with data like this. However, since splitting the data on the <TAB> character yields a variable number of columns, the standard way of importing this won't work very nicely in the import.text procedure.

In the code below, we will use the jseval function to do the following in JavaScript:

  • create an empty object
  • split each line on the <TAB> character
  • store the first element of each line under the key rowName in the object (circle0, circle1, etc...)
  • store all remaining elements of the line using the element's name as the key, and the number 1 as the value

In [2]:
dataUrl = "http://snap.stanford.edu/data/facebook.tar.gz"

print mldb.put("/v1/procedures/import_non_formated", {
    "type": "import.text",
    "params": {
        "dataFileUrl": "archive+" + dataUrl + "#facebook/3980.circles",
        "headers": ["circles"],
        "select": """
            jseval('
                var row_val = val.split("\t");
                var rtn = {};
                rtn["rowName"] = row_val[0];
                for(i=1; i<row_val.length; i++) {
                    rtn[row_val[i]] = 1;
                }
                return rtn;
                ','val', circles) AS *
        """,
        "outputDataset": {
            "id": "import_non_formated",
            "type": "sparse.mutable"
        },
        "runOnCreation": True
    }
})


<Response [201]>

We can now run a SELECT query on the resulting dataset and get a nice sparse representation:


In [3]:
mldb.query("""
    SELECT * EXCLUDING(rowName)
    NAMED rowName
    FROM import_non_formated 
    ORDER BY CAST(rowName() AS integer) 
    LIMIT 5
""")


Out[3]:
3989 4009 4010 4037 4013 3987 4015 4024 4006
_rowName
circle0 1 1 NaN NaN NaN NaN NaN NaN NaN
circle1 NaN NaN 1 1 NaN NaN NaN NaN NaN
circle2 NaN NaN NaN NaN 1 NaN NaN NaN NaN
circle3 NaN NaN NaN NaN NaN 1 1 1 NaN
circle4 NaN NaN NaN NaN NaN NaN NaN NaN 1

We can now answer a simple question like: Is there any friend of user 3980 that appears in more than one of his circle of friends? It can be answered with the following query:


In [13]:
mldb.query("""
    SELECT *
    FROM transpose(
        (
            SELECT sum({* EXCLUDING(rowName)}) as * 
            NAMED 'result'
            FROM import_non_formated
        )
    )
    ORDER BY result DESC
    LIMIT 5
""")


Out[13]:
result
_rowName
4030 1
4013 1
4020 1
4023 1
3999 1

Since the maximum value is 1, we now know that the answer to the above question is no.

Although there are other ways to obtain the same result, using jseval and the dataset of type sparse.mutable allowed us to transform our data in a single step, without knowing its characteristics in advance. This shows how much added flexibility is added by such a function.

Designing custom feature generators

Another very powerful way the jseval function can be used is as a feature generator. When trying to prototype and iterate quickly, this can be a very efficient way to try out new ideas.

Let's start by creating a toy dataset using the description of machine learning concepts from Wikipedia:


In [17]:
print mldb.put('/v1/procedures/import_ML_concepts', {
        "type":"import.text",
        "params": {
            "dataFileUrl":"http://public.mldb.ai/datasets/MachineLearningConcepts.csv",
            "outputDataset": "ml_concepts",
            "named": "Concepts",
            "select": "Text",
            "runOnCreation": True
        }
    }
)


<Response [201]>

Taking a peek at our data, we see there is a single column called Text that contains a textual description of an ML concept:


In [23]:
mldb.query("SELECT * FROM ml_concepts")


Out[23]:
Text
_rowName
Artificial neural network In machine learning and cognitive science, art...
Autoencoder An autoencoder, autoassociator or Diabolo netw...
Hopfield network A Hopfield network is a form of recurrent arti...
Boltzmann machine Boltzmann machine is a type of stochastic recu...
Restricted boltzmann machines A restricted Boltzmann machine (RBM) is a gene...
Deep belief network In machine learning, a deep belief network (DB...
Logistic regression In statistics, logistic regression, or logit r...
Naive bayes classifier In machine learning, naive Bayes classifiers a...
Support vector machine In machine learning, support vector machines (...

Let's now create a function of type sql.expression containing a jseval function that calculates different statistics about the string it is given. It calculates things like the number of words in the string, the number of capital letters, etc.

Putting it in an sql.expression allows us to reuse it easily later on.


In [21]:
print mldb.put("/v1/functions/getStats", {
    "type": "sql.expression",
    "params": {
        "expression": """
            jseval(' 
                var result = {};

                result["len"] = txt.length;
                result["numWords"] = txt.split(" ").length;
                result["numCapital"] = txt.replace(/[^A-Z]/g, "").length;
                result["numExpl"] = txt.replace(/[^!]/g, "").length;
                result["numQst"] = txt.replace(/[^?]/g, "").length;
                result["containsHashSign"] = txt.replace(/[^#]/g, "").length >= 1;
                result["numNumbers"] = txt.replace(/[^0-9]/g, "").length;

                result["capitalProportion"] = result["numCapital"] / result["len"];
                result["explProportion"] = result["numExpl"] / result["len"];
                result["qstProportion"] = result["numQst"] / result["len"];
                result["numberProportion"] = result["numNumbers"] / result["len"];

                return result;
            ', 'txt', text) as stats
        """
    }
})


<Response [201]>

Now that we have created our getStats function, we can call it on a single string:


In [25]:
mldb.query("SELECT getStats({text: 'This is a test #hopethisworks #mldb'}) as *")


Out[25]:
stats.capitalProportion stats.containsHashSign stats.explProportion stats.len stats.numCapital stats.numExpl stats.numNumbers stats.numQst stats.numWords stats.numberProportion stats.qstProportion
_rowName
result 0.028571 true 0 35 1 0 0 0 6 0 0

Looks like it works! We can also call it on the Text column of our ml_concepts dataset to get the statistics for all the rows of our dataset:


In [22]:
mldb.query("SELECT getStats({text: Text}) as * FROM ml_concepts")


Out[22]:
stats.capitalProportion stats.containsHashSign stats.explProportion stats.len stats.numCapital stats.numExpl stats.numNumbers stats.numQst stats.numWords stats.numberProportion stats.qstProportion
_rowName
Artificial neural network 0.012469 false 0 401 5 0 1 0 60 0.002494 0
Autoencoder 0.010582 false 0 378 4 0 7 0 54 0.018519 0
Hopfield network 0.016461 false 0 486 8 0 10 0 73 0.020576 0
Boltzmann machine 0.015890 false 0 944 15 0 5 0 140 0.005297 0
Restricted boltzmann machines 0.020817 false 0 1249 26 0 14 0 180 0.011209 0
Deep belief network 0.021021 false 0 999 21 0 3 0 155 0.003003 0
Logistic regression 0.010979 false 0 1093 12 0 7 0 158 0.006404 0
Naive bayes classifier 0.016049 false 0 1620 26 0 19 0 238 0.011728 0
Support vector machine 0.012709 false 0 1495 19 0 2 0 236 0.001338 0

Doing most of this is possible in standard SQL, but the jseval implementation is simple, fast and compact. This is a great way to quickly experiment with ideas and gives maximum flexibility to manipulate data.

Where to next?

Check out the other Tutorials and Demos.